fix: extract Docling async markdown result#3031
Conversation
3c02f33 to
606ed09
Compare
|
Rebased this onto the latest Validation on the rebased head
|
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 606ed098cd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
436945c to
6a69155
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6a69155b8e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 27a31e4400
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| except Exception: | ||
| return text | ||
| content_val = get_by_path(payload, content_field) | ||
| return _string_content(content_val) or text |
There was a problem hiding this comment.
Reject empty Docling content instead of indexing envelope
When Docling returns a JSON result whose configured content field is present but empty (for example a blank document, skipped conversion, or a conversion result with document.md_content: ""), this fallback returns the entire JSON envelope as document text. That bypasses the existing Docling parser returned empty content guard in parse_docling and can index status/errors/metadata instead of extracted markdown; distinguish “JSON parsed but content is empty/missing” from “non-JSON raw text” here rather than falling back to text.
Useful? React with 👍 / 👎.
| # DOCLING_CONTENT_FIELD=document.md_content | ||
| # DOCLING_FILE_FIELD=files | ||
| # DOCLING_SUCCESS_VALUES=done,success,completed | ||
| # DOCLING_FAILED_VALUES=failed,error,cancelled |
There was a problem hiding this comment.
Include Docling's failure status in the sample override
If a user uncomments this Docling block, DOCLING_FAILED_VALUES overrides the code default that now includes failure. Docling v1 reports failed async tasks as task_status: "failure", so with this sample configuration a failed conversion is treated as still pending until DOCLING_MAX_POLLS expires instead of raising the parse-service failure promptly; add failure to the example list.
Useful? React with 👍 / 👎.
|
The |
27a31e4 to
b057844
Compare
|
Rebased onto the latest dev branch and updated the Chinese file-processing doc in the new structure. The Docling quick-start endpoint now matches the current 5001 examples, and the obsolete duplicated full_docs section from the old layout is gone. Validation:
|
Summary
Verified locally
Note: running the full tests/test_pipeline_release_closure.py file on Windows still hits two existing path-separator assertions unrelated to this change.
Closes #2996